The Identification and Classification of Unknown Words in Chinese An N-Grams-Based Approach
نویسندگان
چکیده
In this paper, we propose a new approach to identify unknown words in Chinese. This approach adopts an n-grams program to sort out the collocating word / character sequences which are possible words and phrases in Chinese. In addition to proposing the criteria for identifying Chinese new words, was also classify these new words according to their structural and semantic characteristics. The corpus-based approach in identifying Chinese disyllabic words based on mutual information was first studied by Sproat and Shih [1]. The attempt here is to identify Chinese unknown words by collocations. Collocations are sequences of words that tend to appear together. In this paper we describe a set of techniques based on statistical methods for retrieving and identifying unknown words from a Chinese corpus. Here unknown words refer to words that are not included in the 90,000 entries CKIP Electronic Dictionary developed at Institute of Information Science, Academia Sinica. The results retrieved by the n-grams program will be crucial information for updating dictionaries. The n-grams program locates words in context and makes statistical observations to identify collocations. It produces a wide range of collocations which can be further sub-classified as abbreviational words, derived words, proper names, new words, ambiguous words, and collocating strings. The effectiveness of the n-grams program as a retrieval tool for unknown words is measured and evaluated.
منابع مشابه
A heuristic method based on a statistical approach for Chinese text segmentation
The authors propose a heuristic method for Chinese automatic text segmentation based on a statistical approach. This method is developed based on statistical information about the association among adjacent characters in Chinese text. Mutual information of bi-grams and significant estimation of tri-grams are utilized. A heuristic method with six rules is then proposed to determine the segmentat...
متن کاملDesign of nonlinear parity approach to fault detection and identification based on Takagi-Sugeno fuzzy model and unknown input observer in nonlinear systems
In this study, a novel fault detection scheme is developed for a class of nonlinear system in the presence of sensor noise. A nonlinear Takagi-Sugeno fuzzy model is implemented to create multiple models. While the T-S fuzzy model is used for only the nonlinear distribution matrix of the fault and measurement signals, a larger category of nonlinear systems is considered. Next, a mapping to decou...
متن کاملLanguage identification of person names using CF-IOF based weighing function
Information about the language of origin helps in generating pronunciation for foreign words, specially person names, in a text-to-speech synthesis system. It can be used to apply language specific letter-to-sound (LTS) rules to these words during synthesis. In this paper, we propose a novel approach for using substrings of a person name (called letter N-grams) to identify the language of its o...
متن کاملCIC-FBK Approach to Native Language Identification
We present the CIC-FBK system, which took part in the Native Language Identification (NLI) Shared Task 2017. Our approach combines features commonly used in previous NLI research, i.e., word n-grams, lemma n-grams, part-of-speech n-grams, and function words, with recently introduced character n-grams from misspelled words, and features that are novel in this task, such as typed character n-gram...
متن کاملMulti-character Field Recognition for Arabic and Chinese Handwriting
Two methods, Symbolic Indirect Correlation (SIC) and Style Constrained Classification (SCC), are proposed for recognizing handwritten Arabic and Chinese words and phrases. SIC reassembles variable-length segments of an unknown query that match similar segments of labeled reference words. Recognition is based on the correspondence between the order of the feature vectors and of the lexical trans...
متن کامل